Partitioning of the degradation space for OCR training
نویسندگان
چکیده
Generally speaking optical character recognition algorithms tend to perfonn better when presented with homogeneous data. This paper studies a method that is designed to increase the homogeneity of training data, based on an understanding of the types of degradations that occur during the printing and scanning process, and how these degradations affect the homogeneity of the data. While it has been shown that dividing the degradation space by edge spread improves recognition accuracy over dividing the degradation space by threshold or point spread function width alone, the challenge is in deciding how many partitions and at what value of edge spread the divisions should be made. Clustering of different types of character features, fonts, sizes, resolutions and noise levels shows that edge spread is indeed shown to be a strong indicator of the homogeneity of character data clusters.
منابع مشابه
Degradation Specific Ocr
Optical Character Recognition (OCR) is the mechanical or electronic translation of scanned images of handwritten, typewritten, or printed text into machine-encoded text. OCR has many applications, such as enabling a text document in a physical form to be editable, or enabling computer searching on a computer of a text that was initially in printed form. OCR engines are widely used to digitize t...
متن کاملDIAR: Advances in Degradation Modeling and Processing
State-of-the-art OCR/ICR algorithms and software are the result of large-scale experiments on the accuracy of OCR systems and proper selection of the size and distribution of training sets. The key factor in improving OCR technology is the degradation models. While it is a leading-edge tool for processing conventional printed materials, the degradation model now faces additional challenges as a...
متن کاملStart- and end-node segmental-HMM pruning
An efficient decoding algorithm for segmental HMMs (SHMMs) is proposed with multi-stage pruning. The generation by SHMMs of a feature trajectory for each state expands the search space and the computational cost of decoding. It is reduced in three ways: pre-cost partitioning, start-node (SN) beam pruning, and conventional endnode (EN) beam pruning. Experiments show that partitioning cuts comput...
متن کاملA Modfied Self-organizing Map Neural Network to Recognize Multi-font Printed Persian Numerals (RESEARCH NOTE)
This paper proposes a new method to distinguish the printed digits, regardless of font and size, using neural networks.Unlike our proposed method, existing neural network based techniques are only able to recognize the trained fonts. These methods need a large database containing digits in various fonts. New fonts are often introduced to the public, which may not be truly recognized by the Opti...
متن کاملOCR of Degraded Documents using HMM-Based Techniques
We present an OCR system for handling degraded documents, such as faxed text. The basic system utilizes the BBN BYBLOS OCR system, which uses a Hidden Markov Model (HMM) approach for training and recognition. To handle degraded documents, we present two approaches, which can be applied individually or jointly. In the first approach, we train the system on documents that exhibit the expected kin...
متن کامل